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2.2 Estimation and convexity. In an estimation problem, we are given a family V of 
laws (probability measures) on a sample space (X,B), we observe a point x of X, where 
often X is an n-fold product, x = (Xi, . . . ,X n ), and we want to estimate a real- valued 
function g on V. Usually V = {Pg, 6> G 0} and g is written as a function of the 
parameter(s), g(Pg) = g(0). 

Estimation problems are decision problems as treated in Sees. 1.2 and 1.3 where in 
estimation, if one is trying to estimate 9 itself, the action space is equal to the parameter 
space © or, possibly, a larger space including the parameter space, such as its closure if 
is not closed. In estimation of a function g(9), the action space will be a subset of some 
measurable space S which includes the range of g, where g is assumed measurable. The 
loss function in a problem of estimating g{9) will be assumed to satisfy L{6,T) > and 
L{9,T) = if and only if T = g{9). Often, for a metric (distance) d defined on S, the loss 
function L{Q,T) will be an increasing function of d{T,g{6)), being small when T is close 
to g(9) and large when T is far from g(9). In any case, a (non-randomized) decision rule 
for an estimation problem is a measurable function from X n into S, called an estimator. 
A particular value of an estimator obtained for some given data is called an estimate. For 
an estimator T(X\, . . . , X n ) we then have the risk 



r(9,T(-)) = EgL(9,T(Xi, . . . ,X n )). 
Here Eg is the expectation when Xi, . . . ,X n are i.i.d. (Pg), so that 

Eg := j ■■■ j ■dPg{X 1 )---dPg{X n ). 

The parameter spaces and spaces S statisticians have considered up to now have often 
been subsets of Euclidean spaces R fc with their usual norms ||x|| := {x\ + • • • + xi) 1 / 2 . 
The loss function most treated in classical statistics has been L{9,T) = \\T — g(9) || 2 . In 
one dimension, this is just the squared difference between the estimate and the quantity 
to be estimated, called "squared-error loss." This is not to say that in applications of 
statistics, individuals in fact suffer losses, even approximately, proportional to the squared 
errors. Rather, the theory based on squared-error losses has been easier to work out and 
is relatively traditional in statistics. It may be hoped that results for squared-error losses 
will shed light on the situation for other, possibly more realistic loss functions. 

For example, let X = = R and Pg = N(9, 1), 9 e R. Then the classical estimator 
for 9 is T(X\ , . . . , X n ) = X := (Xi + ■ ■ ■ X n ) jn. It does minimize the mean squared-error 
loss under some conditions, as will be seen later. 

Testing between two simple hypotheses P and Q, as in Sees. 1.1 and 1.5-1.7, is a 
special case of estimation in which the parameter space has only two points. But note 
that in that case, if the losses Lpq and Lqp are different, the loss is not any function of 
a metric donV = {P, Q}, which would satisfy d(P, Q) = d(Q, P). 

Admissibility for estimators (for a given loss function) is defined from the general 
definition for decision rules. Likewise, if a prior is given on the parameter space, an 
estimator which minimizes the overall risk is called a Bayes estimator. 
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For a fixed e 0, consider the constant estimator T = This makes r(T,4>) = 

0, a minimum. If this T is not admissible, there is some other statistic U such that 
U (Xi, . . . , X n ) = g(4>) almost surely for P^. This implies that for any 9 G such that P# 
is absolutely continuous with respect to P^, we also have U = g(4>) almost surely for Pg, 
so that U cannot distinguish 9 from 0. If all the Pq are equivalent (mutually absolutely 
continuous), for example if each Pq = N(n, a 2 ) for some n e R and a > 0, the constant T 
will always be admissible. 

This trivial estimator T is admissible because it does well when g(9) = g((p). For 
other 9, T will tend to do badly. Unlike reasonable estimators, the estimator T does not 
provide a better approximation to the true g{9) as n increases. So it will not be enough 
for an estimator to be admissible; some other good properties must be sought. One way 
to rule out such bad behavior as that of constant estimators is the following property: 

Definition. If g is a function from into some lR fc , a statistic T(X\, . . . , X n ) with values 
in M. k is called an unbiased estimator of g iff for all 9 e 0, EqT = g{9). 

The sample mean X is evidently unbiased as an estimator of the true mean for any 
distribution having a finite mean. Constant estimators of a non-constant function g(9) will 
not be unbiased. A requirement that estimators be unbiased can, however, lead to bad 
estimators as in the following: 

Example. A function g(9) may have a unique unbiased estimator which is inadmissible. 
For A > 0, let C\ be the Poisson distribution with parameter A, conditioned on k > 1, in 
other words 

C x (k) := e- x X k /(k\(l-e- x )) for A; = 1,2,.... 

Let g(X) := e~ x . For V(-) to be an unbiased estimator of g we have X)fc>i X k V(k)/kl = 
1 — e~ x for all A > 0. Comparing power series coefficients gives V(k) = (— l) fc+1 . This 
is the unique unbiased estimator of e~ x . For k even, V(k) = —1, a bad estimate for e~ x 
which is positive. For any loss function which increases as the estimate moves away from 
the true value, it will reduce the risk to replace —1 by 0. Even then, the choice between 
the maximum value 1 and the minimum value for the estimator based on the parity of 
k seems unreasonable since larger values of k indicate larger values of A and so smaller 
values of e~ x . So, at least for large values of k, e~ k would be a much more reasonable 
estimate of e~ x . 

A sequence T n = T n (X\, . . . , X n ) of estimators for some g{9), where T n and g take 
values in a space S with a metric d, is called consistent if T n — > g(9) in probability as 
n — > oo for each Pq, that is, for each e > and each 9, 

lim i^{d(T n (X!,... ,X n ),g(9)) > e} = 0. 

n^oo 

{T n } are called strongly consistent iff for each 9, T n (Xi,... ,X n ) converge to g{9) as 
n — > oo almost surely for Pq° . 

For example, if S = X = R and g{9) = / xdPe, where / \x\dPg < oo for all 9, and 
T n = X = (Xi + ■ ■ ■ X n )/n, then T n are strongly consistent for g{9) by the strong law of 
large numbers (RAP, Theorem 8.3.5). 



2 



As mentioned, one much-used loss function for real-valued g{6) has been squared- 
error loss (T — g{9)) 2 . Perhaps the next most often considered is the absolute deviation 
\T — g(0)\. Both these functions are convex functions / of T — g(9), defined as follows. A 
set C C R is convex if for any u,v G C and < t < 1, we have tu + (1 — t)v G C. Then a 
real- valued function / on C is convex if for any u,v £ C and < t < 1, 

/(tu+(l-*)t;) < tf(u) + (l-t)f(v). 

Then / is continuous on the interior of C (RAP, Theorem 6.3.4) but not necessarily on the 
boundary of C. A basic fact about convex functions and probability is Jensen's inequality 
(RAP, 10.2.6), which says that if X is a random variable having expectation EX and / is 
a convex function then f(EX) < Ef(X) (under suitable measurability conditions). When 
X has just two values u and v, with P(X = u) = t = 1 — P(X = v), Jensen's inequality 
reduces to the definition of convexity. 

When loss functions are convex, decision rules can be improved or simplified in some 
ways. For one, randomization can be dispensed with, in rather general decision problems, 
as follows: 

2.2.1 Theorem. If in a decision problem the action space A is a Borel measurable convex 
subset of some Euclidean space IR fc , and the loss function L(6>, •) for any fixed 9 G is 
convex and Borel measurable on A, then if d is any randomized decision rule such that 
/ ||it||ci(a;)(dit) < oo for P^-almost all x for all 6, d can be replaced by a non-randomized 
rule &(•) without increasing the risk for any 9. 

Proof. Let b(x) := f A ud(x)(du), meaning that the (vector- valued) identity function 
u I— > u on A is integrated with respect to the law d(x)(-). The integral is well-defined for 
P^-almost all x for all 9, and is a measurable non-randomized decision rule by Proposition 
1.2.7. Then by Jensen's inequality (RAP, 10.2.6), for any such x, 

L(9,b(x)) < J L{9,u)d{x){du). 
Integrating with respect to Pg gives 



r(9,b) = J L(9,b{x))dP 9 (x) < J J L(9,u)d(x)(du)dP e (x) = r{9,d), 

finishing the proof. □ 

Next, estimators can be taken measurable for a sufficient a-algebra without increasing 
any risk: 

2.2.2 Theorem (Rao-Blackwell). Let A be a sufficient a-algebra for a family V of laws 
in a decision problem satisfying the conditions in Theorem 2.2.1. Suppose U is any non- 
randomized decision rule with / ||£/"||<iP < oo for every P G V. Then T := E(U\A) is a 
non-randomized decision rule with r(P, T) < r(P, U) for each P G V. 

Proof. By Theorem 2.1.8, under the assumptions, E(U\A) exists and doesn't depend on 
P eV (conditional expectations of vector- valued functions can be taken coordinatewise). 
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Thus r(P,T) is well-defined. If r(P,U) = +00 there is nothing to prove, so we can 
restrict to the class of P G V for which r(P,U) < 00. Thus L(P,U) G C}{P) and has 
a conditional expectation Ep(L(P,U)\A). By the conditional Jensen inequality (RAP, 
10.2.7), E P {L{P, U)\A) > L(P,T) a.s. for each P G V. Integrating both sides with respect 
to P gives the result. □ 

The Rao-Blackwell theorem applies to estimation problems where the loss function is 
a convex function W of Y — g(P) for an estimator Y, such as W(v) = \v\ or W(v) = v 2 . 

Essentially complete and complete classes of decision rules were defined in Sec. 1.2. 
These notions will also be defined here relative to any given classes of decision rules. Let 
U be a class of decision rules and V a class of laws on a sample space. A class T> cW will 
be called essentially complete for V relative to U iff for each U there is some d G T> 
such that for each P G V, r(P, d) < r(P, U). 

2.2.3 Corollary. If A is a sufficient cr-algebra for V, then for any action space A which is 
a convex, Borel measurable subset of some M. k and any loss function L which is convex and 
Borel measurable on A for each P G V, the ^l-measurable statistics form an essentially 
complete class for V relative to the class of all decision rules U such that / ||{7||e£P < 00 
for all P G V. Also, the family of unbiased ^4-measurable estimators of a function g(P) is 
essentially complete relative to the class of all unbiased estimators. 

Proof. We need only apply the definitions and Theorem 2.2.2, and note that if U is an 
unbiased estimator, so is E(U\A) for any sub-a-algebra A. □ 

If an estimator U for g(P) has / ||{7||<iP = +00 for some P, then E\\U — g(P)\\ = +00 
and E(\\U - g(P)\\ 2 ) > [E\\U - g(P)\\] 2 = +00, so there would be infinite risk for P 
for both of the two most-studied loss functions. Such a U seems quite undesirable, so the 
hypothesis on U in the first half of Corollary 2.2.3 seems not too restrictive. 

Note. If an unbiased estimator performs badly, as in the example of estimating e~ x for 
a Poisson variable X observed only for X > 1 earlier in this section, or in problem 5(b), 
then conditioning on a sufficient cr-algebra will not necessarily make the estimator a good 
one. 

PROBLEMS 

1. Let Xi, . . . ,X n be i.i.d. from a distribution with a finite variance a 2 . Let 

n 

s 2 := (n- l)- 1 ^^- -X) 2 and s' 2 := (n-l)s 2 /n for n > 2. 
3=1 

Show that the sequences of estimators for a 2 defined by s 2 and s' 2 are (a) both consistent, 
while (b) only s 2 is unbiased (n > 2). 

(c) Show that for n = 1, there is an unbiased estimator of a 2 for Poisson distributions. 

2. In problem 1, show that for n = 1 there is no unbiased estimator of a 2 for general 
distributions, specifically for normal distributions. Hints: Suppose there is a measurable 
function / on R such that Ef(X) = a 2 whenever X has a N(/j,,a 2 ) distribution. Let 
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Y, Z be i.i.d. N(0, 1). Find Ef(Y + Z) and the conditional expectation E(f(Y + Z)\Y) 
for each Y, whose expectation with respect to Y should equal Ef(Y + Z). 

3. Give an example to show that when T is an unbiased estimator of some g(0), T 2 may 
not be an unbiased estimator for g(9) 2 . Find under what conditions, if any, T 2 will also 
be unbiased for g(0) 2 . 

4. Give an example where a constant c is in the range of the function g on the parameter 
space but T = c is not an admissible estimator. Hint: consider families of laws which 
are not equivalent, e.g. laws U[6, + 1] onK for — oo < 6 < oo, with squared-error loss 
and g{6) = 6. 

5. For a given n consider the family of binomial distributions b(k, n,p) := (fyp k (l — p) n ~ k 
for k = 0, 1, . . . , n. Here the sample space consists of the integers 0,1,... ,n and the 
parameter is p. 

(a) Show that a function g(p) has an unbiased estimator if and only if g is a polynomial 
of degree < n, and then the unbiased estimator is unique. 

(b) Show that the unbiased estimator for g(p) = p 2 is not only when k = but also 
when k = 1. 

6. Let V\ be the class of all laws on K. with finite mean. Let X±, ...,X n be i.i.d. with a 
distribution in V\. For any ci, c n such that Y^=i c j = 1; T := Y^=i c j x j ^ s an unbi- 
ased estimator of fx = EX\ . For the a-algebra S n of permutation-invariant (symmetric) 
measurable events in IR n , as in Problem 5 of Section 2.1, show that E(T\S n ) = X. 

7. Let Xi, . . . , X$ be observed, i.i.d. N(/i, 1) for fx unknown. Let S n := X\ + • • • + X n 
for n = 1, . . . ,5. Show that S$ is a sufficient statistic for \i. Let U := As an 
example of the Rao-Blackwell theorem 2.2.2, find T = E(U\Ss), where the conditional 
expectation given a function is the same as the conditional expectation given the smallest 
cr-algebra for which it is measurable. For squared-error loss, evaluate the risks r(P,T) 
and r(P, U) for any law P = N(/j, 1). 

NOTES 

The example of Poisson distributions conditioned on k > 1 in showing what can go 
wrong with unbiased estimation was found in Kendall and Stuart, 1967, p. 34. At this 
writing the latest edition, the 6th, is Kendall, Stuart and Ord (1994-1999), see vol. 2A. 
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